Goto

Collaborating Authors

 contribution ratio



How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

Neural Information Processing Systems

Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.


How does Architecture Influence the Base Capabilities

Neural Information Processing Systems

Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models?


How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

Neural Information Processing Systems

Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.


How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider Transformer Models

Lu, Xin, Zhao, Yanyan, Qin, Bing

arXiv.org Artificial Intelligence

Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities. FFN-Wider Transformers reduce the contribution ratio of this combination function, leading to a decline in base capabilities. We confirmed this by experiments and proposed Combination Enhancement Architecture (CEA) to address the decline in base capabilities of such models. Significantly, we extended our explanation and CEA to Mixture of Experts (MoE) architecture Transformers, which also alleviated their decline in base capabilities to some extent, proving our work can offer useful guidance for architecture analysis, architecture improvement and architecture design.


Dataset Structural Index: Leveraging a machine's perspective towards visual data

Parikh, Dishant

arXiv.org Artificial Intelligence

But when it came to visual datasets, the field immediately stepped towards the algorithmic side. One of the fundamental reasons was the amount of information needed to translate from an image. But with the introduction of convolutional networks and transfer learning [1], [2], [3], it is possible to convert an image or a visual object into feature vectors without losing too much information about the entity under concern. It defined a way to use feature maps to compare and distinguish one visual object from another [4]. There has been a lot of work in using these feature vector conversions in systems like content-based image retrievals [5], using feature vectors as representations of different scenarios [6], [7]. It is critical to understand that there is a difference between the way a machine looks at the data and the way we do. There are scenarios in which the interpretation through features is a little different from the interpretation of humans. DSI is there to bridge the gap and understand the machine's perspective before molding it to shape better architectures, in turn, better model performances. I think two concepts could be linked together to understand a machine's viewpoint while working with visual


Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

Aoshima, Makoto, Yata, Kazuyoshi

arXiv.org Machine Learning

We consider classifiers for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We first show that high-dimensional data often have the SSE model. We consider a distance-based classifier using eigenstructures for the SSE model. We apply the noise reduction methodology to estimation of the eigenvalues and eigenvectors in the SSE model. We create a new distance-based classifier by transforming data from the SSE model to the non-SSE model. We give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.